Below is the setting up for this class (install packages, mount packages, import data)
Libraries we need to install (remember to uncomment before running)
# Set a CRAN mirror
options(repos = c(CRAN = "https://cran.r-project.org"))
# for Part 1:
#install.packages("tidyverse")
#install.packages("palmerpenguins")
# for Part 2:
#install.packages("tmap")
#install.packages("sf")
#install.packages("RColorBrewer")
# for Part 3:
#install.packages("tidytext")
#install.packages("janeaustenr")
#install.packages("magick")
#install.packages("devtools")
Load the packages we need
# for part 1
library(tidyverse)
library(palmerpenguins)
#for part 2
library(tmap)
library(sf)
library(RColorBrewer)
# for part 3
library(tidytext) # to work with unstructured data
library(janeaustenr) # to fetch the dataset
library(magick) # to display images
For this first quick overview we are going to use our dear Palmer
Penguins
Palmer Penguins is a great dataset for data exploration and visualisation a and generally a really good alternative to Iris or matcar.
Data were collected and made available by Dr.Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
The palmerpenguins package contains two datasets. One is called
penguins, and is a simplified version of the raw dataset,
the second dataset is penguins_raw and contains the raw
data.
- A set of rules to
facilitate data visualisation developed by Leland
Wilkinson. - It allows us to think of plot generation by following a
step by step recipe (and like all good recipes, can be modified when
needed). - For a full overview, check out Hadley Wickham’s
free textbook on using ggplot. - For inspiration on more complex
plots and advanced techniques, Cedric Sherer’s
blog is full of great ideas!
The recipe of a nice plot includes:
ggplot()ggplot()
The first level is just the data we are going to use
From there, we need to specify the data we need.
We can feed in the data as it is.
However, at this stage our plot is still blank.
# Regular data
ggplot(data = penguins)
aes()
command after specifying the data.head(penguins, 20)
## # A tibble: 20 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## 11 Adelie Torgersen 37.8 17.1 186 3300
## 12 Adelie Torgersen 37.8 17.3 180 3700
## 13 Adelie Torgersen 41.1 17.6 182 3200
## 14 Adelie Torgersen 38.6 21.2 191 3800
## 15 Adelie Torgersen 34.6 21.1 198 4400
## 16 Adelie Torgersen 36.6 17.8 185 3700
## 17 Adelie Torgersen 38.7 19 195 3450
## 18 Adelie Torgersen 42.5 20.7 197 4500
## 19 Adelie Torgersen 34.4 18.4 184 3325
## 20 Adelie Torgersen 46 21.5 194 4200
## # ℹ 2 more variables: sex <fct>, year <int>
ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g)) # setting x and y
ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g)) +
geom_point()
Now that we can finally see something let’s refine the coordinates/aesthetics a bit more playing with colours and shapes
ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=species)) +
geom_point()
It is not the best of ideas but we can also use shape to add an additional layer of information
ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=species, size=flipper_length_mm)) +
geom_point()
NB. if you want colour and size to be connected with a variable (i.e. being part of the legend you set them up within the aesthetics otherwise they goes in the geometry layer)
ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=species)) +
geom_point(size=4)
If we want to subdivide the plot in more subplot you can use
facet_wrap or facet_grid
facet_wrap is to be used if you want to subplot only by
one variable
ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=species)) +
geom_point()+
facet_wrap(~species)
facet_gridif you want to create an array across two
variables
ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=species)) +
geom_point()+
facet_grid(sex~species)
ggplot(data = penguins,
aes(x = species, y = flipper_length_mm))+
geom_boxplot()+
stat_summary(fun.data = mean_se,
color = "red") # adjusting size
On this level you can set the attributes of the coordinates, change the scale or transform
ggplot(data = penguins,
aes(x = species, y = flipper_length_mm))+
geom_boxplot()+
stat_summary(fun.data = mean_se,
color = "red")+
coord_flip()
On this level you can set everything that is not connected directly with the data, from background to colours and axis labels.
Change the background to Black and White
ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=species)) +
geom_point(size=3)+
theme_bw()
Add Title, subtitle etc..
ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=species)) +
geom_point(size=3)+ theme_bw()+
labs(title = "New plot title", subtitle = "A subtitle", caption = "(based on data from ...)", x = "New x label", y= "New y label", color = "Colours")
This is my favourite rabbit hole.
Pre-made colour and theme options also exist:
In the interest of time we are going to see just one simple example but really the sky is the limit you can start having a look from this wiki
ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=species)) +
geom_point(size=3)+ theme_bw()+
labs(title = "New plot title", subtitle = "A subtitle", caption = "(based on data from ...)", x = "New x label", y= "New y label", color = "Colours")+
scale_colour_manual(values = c("darkorange","purple","cyan4"))
N.B. depending on the graph type it could be fill/colour
Create a visualisation using the Penguins dataset that will show the relationship between bill length and bill depth across the different species of penguins and the different sexes. Rename the graph “My Penguin Graph” and transform the units of measurements in cm (Tip: you can divide x and y directly in the aesthetic and use the labs level)
ggplot( )
Ok now that we have cover some basics let’s focus on something funnier, first some geographical plotting and then sentiment analysis results
For this part For the History of Scotland-focused text analysis, we are going to explore the Statistical Accounts of Scotland. More information on the dataset can be found on the StatAccount Website. The ‘Old’ Statistical Account (1791-99), under the direction of Sir John Sinclair of Ulbster, and the ‘New’ Statistical Account (1834-45) are reports of life In Scotland during the XVIII and XIX centuries.
They offer uniquely rich and detailed parish reports for the whole of Scotland, covering a vast range of topics including agriculture, education, trades, religion and social customs.
We are also going to use edited data from the National Records of Scotland
Import the data that we will need
First the CSV containing the text of the StatAccount
Parish <- read_csv("data/parish.csv")
## Rows: 27065 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): title, text, Type, TypeDescriptive, RecordID, Area, Parish
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(Parish)
## title text Type TypeDescriptive
## Length:27065 Length:27065 Length:27065 Length:27065
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## RecordID Area Parish
## Length:27065 Length:27065 Length:27065
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
Then we import the first GeoPackage. A GeoPackage is an open, standards-based format designed for the efficient storage, transfer, and exchange of geospatial data. Developed by the Open Geospatial Consortium (OGC), it serves as a container for various types of geospatial information, including vector features, raster maps, and attribute data, all within a single file https://www.geopackage.org/.
st_read Function: * from st package that reads vector spatial data. * dsn = data source name, essentially the file name and the folder path
ParishesGeo <- st_read(dsn = "data/Spatial/Parishes.gpkg")
## Multiple layers are present in data source C:\Users\lmichiel\Documents\GitHub\DH-RSESummerSchool2024\day 1\DataVisWithR\data\Spatial\Parishes.gpkg, reading layer `civilparish_pre1891'.
## Use `st_layers' to list all layer names and their type in a data source.
## Set the `layer' argument in `st_read' to read a particular layer.
## Reading layer `civilparish_pre1891' from data source
## `C:\Users\lmichiel\Documents\GitHub\DH-RSESummerSchool2024\day 1\DataVisWithR\data\Spatial\Parishes.gpkg'
## using driver `GPKG'
## Simple feature collection with 35 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 5550.178 ymin: 530264.1 xmax: 469816.2 ymax: 1220373
## Projected CRS: OSGB36 / British National Grid
plot(ParishesGeo, main = "Scottish Parishes")
As you can see from the plot, the dataset is made up of vector polygons. You can also change the basic presentation, such as the colour of the fill, line width and colour.
plot(ParishesGeo,
col = "black",
lwd = 1,
border = "white",
main = "Scottish Parishes")
Besides the parish boundaries, we will also need the location of distilleries across Scotland. Load geo-spatial information for the location of distilleries. This is a vector point dataset.
PointsDistilleries<- st_read(dsn = "data/Spatial/ScottishDistilleries.gpkg")
## Reading layer `scotdistilleries' from data source
## `C:\Users\lmichiel\Documents\GitHub\DH-RSESummerSchool2024\day 1\DataVisWithR\data\Spatial\ScottishDistilleries.gpkg'
## using driver `GPKG'
## Simple feature collection with 109 features and 20 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 126650.9 ymin: 554418.5 xmax: 412088.9 ymax: 1010713
## Projected CRS: OSGB36 / British National Grid
plot(PointsDistilleries,
main = "Scottish Distilleries")
## Warning: plotting the first 10 out of 20 attributes; use max.plot = 20 to plot
## all
We will work with first, the vector polygons containing the parish boundaries. At the moment, the vector polygon dataset contains only info from the Geopackage. To add the info from the parish dataset (i.e. information from the csv file) we need to merge the Geopackage with it.
Because we want to see how often mention of a certain topic are present in the text we want to search for specific keywords
The first topic we are going to look at is Illness. So we are creating a new variable that would contain yes if the text contains one of the keywords or no if it does not
1.Search keywords
Parish$Ilness<- ifelse(grepl("ill|ilness|sick|cholera|smallpox|plague|cough|typhoid|fever|measles|dysentery", Parish$text,
ignore.case = T), "yes","no")
head(Parish$Ilness)
## [1] "yes" "yes" "yes" "yes" "no" "yes"
To do this we use a pipe, if you have never seen a pipe before is basically a way to perform a series of action on a dataset in a certain order (you can think at it as bullet points of actions)
IlnessGroup <- Parish %>%
group_by(Area) %>%
summarise(Total = n(),
count = sum(Ilness == "yes")) %>%
mutate(per = round(count/Total, 2))
head(IlnessGroup)
## # A tibble: 6 × 4
## Area Total count per
## <chr> <int> <int> <dbl>
## 1 Aberdeen 2193 1790 0.82
## 2 Argyle 1336 1093 0.82
## 3 Ayrshire 1493 1312 0.88
## 4 Banff 820 697 0.85
## 5 Berwick 676 571 0.84
## 6 Bute 147 128 0.87
3.Merge the two datasets
MergedGeo <-merge(ParishesGeo,IlnessGroup,
by.x="JOIN_NAME_",
by.y="Area",
all.x = TRUE) # nb this is left join cause I want to preserve all the records present in ParishGeo
4.Check data to have merged properly
head(MergedGeo, max.level = 2)
## Simple feature collection with 6 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 92508.41 ymin: 571189.1 xmax: 414043.2 ymax: 868786.8
## Projected CRS: OSGB36 / British National Grid
## JOIN_NAME_ Total count per geometry
## 1 Aberdeen 2193 1790 0.82 MULTIPOLYGON (((394004.6 80...
## 2 Argyle 1336 1093 0.82 MULTIPOLYGON (((183373.2 73...
## 3 Ayrshire 1493 1312 0.88 MULTIPOLYGON (((218097.2 65...
## 4 Banff 820 697 0.85 MULTIPOLYGON (((316487.5 83...
## 5 Berwick 676 571 0.84 MULTIPOLYGON (((372205.7 66...
## 6 Bute 147 128 0.87 MULTIPOLYGON (((206008.4 63...
1.Create a continuous color palette
color.palette <- colorRampPalette(c("white", "red"))
tm_shape is a function in the tmap package (Thematic maps). Thematic maps can be generated with great flexibility. The syntax for creating plots is similar to that of ggplot2, but tailored to maps. To plot a tmap, you will need to specify firstly tm_shape, layers then can be added with the + operator. tm_fill specifies the presentation of the polygons. To differentiate NA values from other valid entries, colorNA is added.
tm_shape(MergedGeo) + # Specify the spatial object (MergedGeo) to be used in the map
tm_fill("per", palette = color.palette(100), colorNA = "grey") + # Fill polygons based on 'per' variable, using a custom color palette with 100 colors; grey for NA values
tm_borders(col = "black") + # Add black borders to each polygon
tm_layout(title = "Illness report", legend.text.size = 0.75, legend.title.size = 1, frame = FALSE) # Set layout: add a title, resize legend text and title, remove frame
Let’s try changing the colour of the filled regions using predifined colours. There are predifined colour palettes you can use directly. Commonly used palettes include: rainbow(), heat.colors(), topo.colors(), and terrain.colors() Beware of the representation of colours. You might need to reverse the colour band to make the representations more intuitive.
tm_shape(MergedGeo) + # Specify the spatial object (MergedGeo) to be used in the map
tm_fill("per", palette = rev(heat.colors(100)), colorNA = "grey") + # Fill polygons based on 'per' variable, using a reversed heat.colors palette with 100 colors; grey for NA values
tm_borders(col = "black") + # Add black borders to each polygon
tm_layout(title = "Illness report", legend.text.size = 0.75, legend.title.size = 1, frame = FALSE) # Set layout: add a title, resize legend text and title, remove frame
You could also change the colour using RColorBrewer
display.brewer.all()# show all the palettes in Colour brewer
color.palette <- brewer.pal(n = 9, name = "YlOrRd")#create a tailored new palette
We can now replot using the new palette.
tm_shape(MergedGeo) + # Specify the spatial object (MergedGeo) to be used in the map
tm_fill("per", palette = color.palette, colorNA = "grey") + # Fill polygons based on 'per' variable, using a custom color palette (color.palette); grey for NA values
tm_borders(col = "black") + # Add black borders to each polygon
tm_layout(title = "Illness report", legend.text.size = 0.75, legend.title.size = 1, frame = FALSE) # Set layout: add a title, resize legend text and title, remove frame
Try to re-plot the map using a different colour range. Add your code below.
Change the spacing of the interval. The intervals can be keyed in directly using and style to change the type of breaks
1. “fixed”: User-defined fixed breaks.
2. “pretty”: Breaks at pretty intervals (often used for visual appeal).
3. “quantile”: Breaks at quantile intervals (each class has an equal number of observations).
4. “equal”: Breaks at equal intervals.
5. “kmeans”: Breaks determined by k-means clustering.
6. “hclust”: Breaks determined by hierarchical clustering.
7. “bclust”: Breaks determined by bin-based clustering.
8. “fisher”: Breaks determined by Fisher-Jenks natural breaks optimization.
9. “jenks”: Another name for Fisher-Jenks breaks.
10. “sd”: Breaks determined by standard deviations from the mean.
11. “log10_pretty”: Breaks determined by log10 transformed values with pretty intervals.
12. “cont”: Continuous color scale (no discrete breaks).
tm_shape(MergedGeo) + # Specify the spatial object (MergedGeo) to be used in the map
tm_fill("per", style = "equal", n = 10, palette = color.palette, colorNA = "grey") + # Fill polygons based on 'per' variable; use equal interval classification with 10 classes; custom color palette; grey for NA values
tm_borders(col = "black") + # Add black borders to each polygon
tm_layout(title = "Illness report", legend.text.size = 0.75, legend.title.size = 1, frame = FALSE, legend.position = c(1, 0.5)) # Set layout: add a title, resize legend text and title, remove frame, position legend at (1, 0.5)
Try adjusting these values and explore the effects. Write your code below.
The steps are always the same first we need to search keywords and then we merge the results with our map of Scotland.
If you want to try out yourself more instead of looking at the code below try to replicate the steps we have done before for the illnesses below here.
#Parish$witches<-ifelse
Parish$witches<- ifelse(grepl("witch|spell|witches|enchantemt|magic", Parish$text, ignore.case = T), "yes","no")
Can you think to other keywords? just add them to the code above.
Then we group by
WitchGroup <- Parish %>%
group_by(Area) %>%
summarise(Total = n(), count = sum(witches == "yes")) %>%
mutate(per = round(count / Total, 2))
And finally we merge
MergedGeo2 <-merge(ParishesGeo,WitchGroup, by.x="JOIN_NAME_", by.y="Area", all.x = TRUE) # nb this is left join cause I want to preserve all the records present in ParishGeo
Let’s create a more “witchy” Palette
color.palette2 <- colorRampPalette(c("white", "purple"), alpha = 0.5)
tm_shape(MergedGeo2) +
tm_fill("per", palette = color.palette2(100), colorNA = "grey") +
tm_borders(col = "black")+
tm_layout(title = "Witchcraft report",
legend.text.size = 0.75,
legend.title.size = 1,
frame = FALSE)
Adding the scale bar and north arrow to the map using tmap is a lot simpler.
tm_shape(MergedGeo2) +
tm_fill("per",
style = "equal",
n = 5,
palette = color.palette2(100),
colorNA = "grey") +
tm_borders(col = "black")+
tm_layout(title = "Witches Reports",
legend.text.size = 0.75,
legend.title.size = 1,
frame = FALSE) +
tm_scale_bar(position = "left") + #add scalebar
tm_compass(size = 1.5)#add north arrow
Let’s connect back to the one of the main topics of this week and look at whiski consumption across Scotland.
Unsurprisingly the first steps remain the same.
If you want to try out yourself more instead of looking at the code below try to replicate the steps we have done before below here.
#Parish$Booze<-ifelse
Parish$Booze<- ifelse(grepl("illicit still|illicit distillery|drunk|intemperance|wisky|whisky|whiskey|whysky |alembic",Parish$text, ignore.case = T), "yes","no")
BoozeGroup <- Parish %>%
group_by(Area) %>%
summarise(Total = n(), count = sum(Booze == "yes")) %>%
mutate(per = round(count / Total, 2))
MergedGeo3 <-merge(ParishesGeo,BoozeGroup, by.x="JOIN_NAME_", by.y="Area",all.x = TRUE) # nb this is left join cause I want to preserve all the records present in ParishGeo
color.palette3 <- colorRampPalette(c("white", "Brown"))
tm_shape(MergedGeo3) +
tm_fill("per",
style = "equal",
n = 5,
palette = color.palette3(100),
colorNA = "grey") +
tm_borders(col = "black")+
tm_layout(title = "Whisky Reports",
legend.text.size = 0.75,
legend.title.size = 1,
frame = FALSE) +
tm_scale_bar(position = "left") +
tm_compass(size = 1.5)
Add the second dataset i.e. the punctual dataset with the location of the modern day distilleries.
tm_shape(MergedGeo3) +
tm_fill("per",
style = "equal",
n = 5,
palette = color.palette3(100),
colorNA = "grey") +
tm_borders(col = "black")+
tm_layout(title = "Whisky Reports",
legend.text.size = 0.75,
legend.title.size = 1,
frame = FALSE) +
tm_scale_bar(position = "left") +
tm_compass(size = 1.5)+
tm_shape(PointsDistilleries) + # we add our new datest
tm_dots(size=0.1,
col="black", #This time they are dots rather than fill
colorNA = NULL)
We can also use bespoke symbols for distilleries locations and plot it again.
icon <- tmap_icons("data/bottle.png")
tm_shape(MergedGeo3) +
# Fill the polygons based on the "per" attribute
tm_fill("per",
style = "equal", # Use equal interval breaks
n = 5, # Number of classes to divide the data into
palette = color.palette3(100), # Color palette with 100 color levels
colorNA = "grey") + # Color for missing values
# Add borders to the polygons
tm_borders(col = "black") +
# Add another spatial object to the map
tm_shape(PointsDistilleries) +
# Add symbols (icons) for the spatial points
tm_symbols(size = 0.3, # Size of the symbols
clustering = TRUE,
shape = icon,# Symbol shape (specified by the 'icon' variable)
border.lwd = 0,) + # Border width of the symbols
# Add layout elements like title and legend settings
tm_layout(title = "Booze Reports", # Title of the map
legend.text.size = 0.75, # Size of the legend text
legend.title.size = 1, # Size of the legend title
frame = FALSE) + # Do not draw a frame around the map
# Add a scale bar to the map
tm_scale_bar(position = "left") + # Position of the scale bar
# Add a compass to the map
tm_compass(size = 1.5) # Size of the compass
Search for a different topic in the dataset and create a new visualisation
Download and look at the bing sentiment library
get_sentiments("bing")
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
This is a short demo to show how you can create a .gif that would record the sentiment evolution across Jane Austen books. To do so we need to follow these steps
For each words we are going to collect information about
AustenTable <- austen_books() %>% #create a new file named AustenTable that will extract info from austen_books
group_by(book) %>% # group by every single book then
mutate( # manipulate the data to create
linenumber = row_number(), # a line number column that would count in which row the word was
chapter = cumsum(str_detect(text, # the chapter number. We can do so by using regex and find lines starting with chapter followed by a space and a letter
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>% # This line removes the grouping, so subsequent operations will be applied to the entire dataset rather than grouped subsets.
unnest_tokens(word, text) # This tokenises the text column, splitting it into individual words and creating a new row for each word
head(AustenTable)
## # A tibble: 6 × 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Sense & Sensibility 1 0 sense
## 2 Sense & Sensibility 1 0 and
## 3 Sense & Sensibility 1 0 sensibility
## 4 Sense & Sensibility 3 0 by
## 5 Sense & Sensibility 3 0 jane
## 6 Sense & Sensibility 3 0 austen
Now that we have the list of all words we extract the average sentiment of subsets of each chapter of each book. To do so as before we manipulate our dataset to do what we want.
Because we want to create an uniform pattern that would simulate a tapestry we want to divide our subset into equal collections of words
jane_austen_sentiment <- AustenTable %>% # Load Jane Austen's books dataset and start a chain of operations to create a Jane_austen_sentiment dataset
inner_join(get_sentiments("bing"), relationship = 'many-to-many') %>% # Join the dataset with the Bing lexicon sentiment dictionary
group_by(book, chapter) %>% # Group the dataset by book and chapter
mutate(index = rep(1:10, each = ceiling(n() / 10), length.out = n())) %>% # Create an index to split the chapters into 10 segments no matter how long the chapter it is
group_by(book, chapter, index) %>% # Regroup the dataset by book, chapter, and index
count(sentiment) %>% # Count the occurrences of each sentiment within each segment
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% # Reshape the data from long to wide format
mutate(sentiment = positive - negative,index= as.factor(index))%>% # Calculate sentiment score (positive - negative) for each segment
filter(!chapter=="0") # Filter out chapters with the value "0" (if any)
## Joining with `by = join_by(word)`
head(jane_austen_sentiment)
## # A tibble: 6 × 6
## # Groups: book, chapter, index [6]
## book chapter index negative positive sentiment
## <fct> <int> <fct> <int> <int> <int>
## 1 Sense & Sensibility 1 1 2 11 9
## 2 Sense & Sensibility 1 2 4 9 5
## 3 Sense & Sensibility 1 3 7 6 -1
## 4 Sense & Sensibility 1 4 3 10 7
## 5 Sense & Sensibility 1 5 4 9 5
## 6 Sense & Sensibility 1 6 3 10 7
Plot a graph for each book that will show the sentiment values tapestry across the book To do so we need to do a series of small steps
dir_out <- file.path("outputs/Austen") # Define the directory path where the outputs will be saved
dir.create(dir_out, recursive = TRUE) # Create the directory if it doesn't already exist
## Warning in dir.create(dir_out, recursive = TRUE): 'outputs\Austen' already
## exists
books <- unique(jane_austen_sentiment$book)
books
## [1] Sense & Sensibility Pride & Prejudice Mansfield Park
## [4] Emma Northanger Abbey Persuasion
## 6 Levels: Sense & Sensibility Pride & Prejudice Mansfield Park ... Persuasion
most_chapter <- max(jane_austen_sentiment$chapter, na.rm = TRUE)# Find the maximum chapter number in the dataset 'jane_austen_sentiment'
most_chapter
## [1] 61
Now we have all that we need to create a for loop that will automatically create a graph for each book. A “for” loop is a control flow statement in programming languages that allows you to repeatedly execute a block of code a specified number of times or iterate over a sequence of values.
for (y in books) {# Iterate over each book in the 'books' vector y is the name we are going to the iteration variable it can be anything as long as you are consistent
p <- # p is just a name we are giving to the plot again you can change it as long as you are consistent
jane_austen_sentiment %>%
filter(book == y) %>% # Filter the 'jane_austen_sentiment' dataset for the current book
ggplot(aes(chapter,index, fill= sentiment)) + # Create a ggplot object with chapter, index, and sentiment as aesthetics
geom_tile() +# Add a tile layer to create a heatmap
scale_x_continuous(breaks=seq(1,most_chapter,1), expand = c(0,0))+ # Customise x-axis scale to show breaks from 1 to 'most_chapter'
scale_fill_gradient(low="blue", high="red", limits = c(-20, 40))+# Customise fill scale to use a gradient from blue to red
theme_bw()+ # Apply a black-and-white theme
guides(fill="none")+ # Remove the fill legend
ggtitle(y)+ # Add a title to the plot with the current book's name
coord_fixed(ratio = 1, ylim = c(10,1), xlim = c(0.5,most_chapter+0.5))+ # Fix the aspect ratio and set limits for y and x axes
theme( # Customize theme to remove y-axis labels and ticks
axis.title.y = element_blank(),
axis.text.y= element_blank(),
axis.ticks.y = element_blank()
)
fp <- file.path(dir_out, paste0(y, ".png"))# Define the file path where the plot will be saved
ggsave(plot = p, # Save the ggplot object as a PNG file
filename = fp, # File path where the plot will be saved
device = "png", # Output device type (PNG format)
width=3500,# Width of the output in pixels
height = 1000, # Height of the output in pixels
units = "px") # Units of width and height (pixels)
}# Close the loop
The bit of code below it is just to look at one of the plots created
Image <- image_read('outputs/Austen/Emma.png')
Image
Good! We are almost there now we need to create a gif out of the single plots
imgs <- list.files(dir_out, full.names = TRUE) # List all file names in the directory 'dir_out' and store them in 'imgs'
img_list <- lapply(imgs, image_read) # Read each image file from the list of file names using 'image_read' and store them in 'img_list'
img_joined <- image_join(img_list) # Join the list of images into a single animated image using 'image_join'
img_animated <- image_animate(img_joined, fps = 1) # Create an animated image from the joined image with a frame rate of 1 frame per second using 'image_animate'
image_write(image = img_animated,
path = "outputs/austen.gif")
Let’s look at what we have done
img_animated
Create a similar visualisation using a different datasets. You can have a look of what is directly available in R here
Hint version using Sherlock books
Solution
devtools::install_github("EmilHvitfeldt/sherlock")
## Using GitHub PAT from the git credential store.
## Skipping install of 'sherlock' from a github remote, the SHA1 (38584034) has not changed since last install.
## Use `force = TRUE` to force installation
library(sherlock)
SherlockTable <- holmes %>% #create a new file named AustenTable that will extract info from austen_books
group_by(book) %>% # group by every single book then
mutate( # manipulate the data to create
linenumber = row_number(), # a line number column that would count in which row the word was
chapter = cumsum(str_detect(text, # the chapter number. We can do so by using regex and find lines starting with chapter followed by a space and a letter
regex("^CHAPTER",
ignore_case = TRUE)))) %>%
ungroup() %>% # This line removes the grouping, so subsequent operations will be applied to the entire dataset rather than grouped subsets.
unnest_tokens(word, text) # This tokenises the text column, splitting it into individual words and creating a new row for each word
head(SherlockTable)
## # A tibble: 6 × 4
## book linenumber chapter word
## <chr> <int> <int> <chr>
## 1 A Study In Scarlet 1 0 a
## 2 A Study In Scarlet 1 0 study
## 3 A Study In Scarlet 1 0 in
## 4 A Study In Scarlet 1 0 scarlet
## 5 A Study In Scarlet 3 0 table
## 6 A Study In Scarlet 3 0 of
Sherlock_sentiment <- SherlockTable %>%
inner_join(get_sentiments("bing"), relationship = 'many-to-many') %>% # Join the dataset with the Bing lexicon sentiment dictionary
group_by(book, chapter) %>% # Group the dataset by book and chapter
mutate(index = rep(1:10, each = ceiling(n() / 10), length.out = n())) %>% # Create an index to split the chapters into 10 segments no matter how long the chapter it is
group_by(book, chapter, index) %>% # Regroup the dataset by book, chapter, and index
count(sentiment) %>% # Count the occurrences of each sentiment within each segment
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% # Reshape the data from long to wide format
mutate(sentiment = positive - negative,index= as.factor(index))%>% # Calculate sentiment score (positive - negative) for each segment
filter(!chapter=="0") # Filter out chapters with the value "0" (if any)
## Joining with `by = join_by(word)`
head(Sherlock_sentiment)
## # A tibble: 6 × 6
## # Groups: book, chapter, index [6]
## book chapter index negative positive sentiment
## <chr> <int> <fct> <int> <int> <int>
## 1 A Scandal in Bohemia 4 1 8 13 5
## 2 A Scandal in Bohemia 4 2 9 12 3
## 3 A Scandal in Bohemia 4 3 11 10 -1
## 4 A Scandal in Bohemia 4 4 9 12 3
## 5 A Scandal in Bohemia 4 5 11 10 -1
## 6 A Scandal in Bohemia 4 6 9 12 3
dir_outS <- file.path("outputs/Sherlock") # Define the directory path where the outputs will be saved
dir.create(dir_outS, recursive = TRUE) # Create the directory if it doesn't already exist
## Warning in dir.create(dir_outS, recursive = TRUE): 'outputs\Sherlock' already
## exists
books2 <- unique(Sherlock_sentiment$book)
books2
## [1] "A Scandal in Bohemia" "A Study In Scarlet"
## [3] "The Adventure of Wisteria Lodge" "The Adventure of the Red Circle"
## [5] "The Hound of the Baskervilles" "The Sign of the Four"
## [7] "The Valley Of Fear"
most_chapter <- max(Sherlock_sentiment$chapter, na.rm = TRUE)# Find the maximum chapter number in the dataset
most_chapter
## [1] 15
for (y in books2) {# Iterate over each book in the 'books' vector y is the name we are going to the iteration variable it can be anything as long as you are consistent
p <- # p is just a name we are giving to the plot again you can change it as long as you are consistent
Sherlock_sentiment %>%
filter(book == y) %>%
ggplot(aes(chapter,index, fill= sentiment)) + # Create a ggplot object with chapter, index, and sentiment as aesthetics
geom_tile() +# Add a tile layer to create a heatmap
scale_x_continuous(breaks=seq(1,most_chapter,1), expand = c(0,0))+ # Customise x-axis scale to show breaks from 1 to 'most_chapter'
scale_fill_gradient(low="blue", high="red", limits = c(-20, 40))+# Customise fill scale to use a gradient from blue to red
theme_bw()+ # Apply a black-and-white theme
guides(fill="none")+ # Remove the fill legend
ggtitle(y)+ # Add a title to the plot with the current book's name
coord_fixed(ratio = 1, ylim = c(10,1), xlim = c(0.5,most_chapter+0.5))+ # Fix the aspect ratio and set limits for y and x axes
theme( # Customize theme to remove y-axis labels and ticks
axis.title.y = element_blank(),
axis.text.y= element_blank(),
axis.ticks.y = element_blank()
)
fp <- file.path(dir_outS, paste0(y, ".png"))# Define the file path where the plot will be saved
ggsave(plot = p, # Save the ggplot object as a PNG file
filename = fp, # File path where the plot will be saved
device = "png", # Output device type (PNG format)
width=3500,# Width of the output in pixels
height = 1000, # Height of the output in pixels
units = "px") # Units of width and height (pixels)
}# Close the loop
imgs <- list.files(dir_outS, full.names = TRUE) # List all file names in the directory 'dir_out' and store them in 'imgs'
img_list <- lapply(imgs, image_read) # Read each image file from the list of file names using 'image_read' and store them in 'img_list'
img_joined <- image_join(img_list) # Join the list of images into a single animated image using 'image_join'
img_animated <- image_animate(img_joined, fps = 1) # Create an animated image from the joined image with a frame
image_write(image = img_animated,
path = "outputs/Sherlock.gif")
img_animated
THE END